{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to SageMaker ObjectToVec model for MovieLens recommendation\n",
    "\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Data exploration and preparation](#Data-exploration-and-preparation)\n",
    "1. [Rating prediction task](#Rating-prediction-task)\n",
    "1. [Recommendation task](#Recommendation-task)\n",
    "1. [Movie retrieval in the embedding space](#Movie-retrieval-in-the-embedding-space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Background\n",
    "\n",
    "### ObjectToVec\n",
    "*Object2Vec* is a highly customizable multi-purpose algorithm that can learn embeddings of pairs of objects. The embeddings are learned such that it preserves their pairwise **similarities** in the original space.\n",
    "- **Similarity** is user-defined: users need to provide the algorithm with pairs of objects that they define as similar (1) or dissimilar (0); alternatively, the users can define similarity in a continuous sense (provide a real-valued similarity score)\n",
    "- The learned embeddings can be used to efficiently compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in the embedding space. In addition, the embeddings can also be used as features of the corresponding objects in downstream supervised tasks such as classification or regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In this notebook example:\n",
    "We demonstrate how Object2Vec can be used to solve problems arising in recommendation systems. Specifically,\n",
    "\n",
    "- We provide the algorithm with (UserID, MovieID) pairs; for each such pair, we also provide a \"label\" that tells the algorithm whether the user and movie are similar or not\n",
    "\n",
    "     * When the labels are real-valued, we use the algorithm to predict the exact ratings of a movie given a user\n",
    "     * When the labels are binary, we use the algorithm to recommendation movies to users\n",
    "\n",
    "- The diagram below shows the customization of our model to the problem of predicting movie ratings, using a dataset that provides `(UserID, ItemID, Rating)` samples. Here, ratings are real-valued"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img style=\"float:middle\" src=\"image_ml_rating.png\" width=\"480\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "- We use the MovieLens 100k dataset: https://grouplens.org/datasets/movielens/100k/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use cases\n",
    "\n",
    "- Task 1: Rating prediction (regression)\n",
    "- Task 2: Movie recommendation (classification)\n",
    "- Task 3: Nearest-neighbor movie retrieval in the learned embedding space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before running the notebook\n",
    "- Please use a Python 3 kernel for the notebook\n",
    "- Please make sure you have `jsonlines` package installed (if not, you can run the command below to install it)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install jsonlines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import csv, jsonlines\n",
    "import numpy as np\n",
    "import copy\n",
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data exploration and preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### License\n",
    "Please be aware of the following requirements about ackonwledgment, copyright and availability, cited from the [data set description page](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).\n",
    ">The data set may be used for any research\n",
    "purposes under the following conditions:\n",
    "     * The user may not state or imply any endorsement from the\n",
    "       University of Minnesota or the GroupLens Research Group.\n",
    "     * The user must acknowledge the use of the data set in\n",
    "       publications resulting from the use of the data set\n",
    "       (see below for citation information).\n",
    "     * The user may not redistribute the data without separate\n",
    "       permission.\n",
    "     * The user may not use this information for any commercial or\n",
    "       revenue-bearing purposes without first obtaining permission\n",
    "       from a faculty member of the GroupLens Research Project at the\n",
    "       University of Minnesota.\n",
    "If you have any further questions or comments, please contact GroupLens \\<grouplens-info@cs.umn.edu\\>. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "curl -o ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip\n",
    "unzip ml-100k.zip\n",
    "rm ml-100k.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first create some utility functions for data exploration and preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## some utility functions\n",
    "\n",
    "def load_csv_data(filename, delimiter, verbose=True):\n",
    "    \"\"\"\n",
    "    input: a file readable as csv and separated by a delimiter\n",
    "    and has format users - movies - ratings - etc\n",
    "    output: a list, where each row of the list is of the form\n",
    "    {'in0':userID, 'in1':movieID, 'label':rating}\n",
    "    \"\"\"\n",
    "    to_data_list = list()\n",
    "    users = list()\n",
    "    movies = list()\n",
    "    ratings = list()\n",
    "    unique_users = set()\n",
    "    unique_movies = set()\n",
    "    with open(filename, 'r') as csvfile:\n",
    "        reader = csv.reader(csvfile, delimiter=delimiter)\n",
    "        for count, row in enumerate(reader):\n",
    "            #if count!=0:\n",
    "            to_data_list.append({'in0':[int(row[0])], 'in1':[int(row[1])], 'label':float(row[2])})\n",
    "            users.append(row[0])\n",
    "            movies.append(row[1])\n",
    "            ratings.append(float(row[2]))\n",
    "            unique_users.add(row[0])\n",
    "            unique_movies.add(row[1])\n",
    "    if verbose:\n",
    "        print(\"In file {}, there are {} ratings\".format(filename, len(ratings)))\n",
    "        print(\"The ratings have mean: {}, median: {}, and variance: {}\".format(\n",
    "                                            round(np.mean(ratings), 2), \n",
    "                                            round(np.median(ratings), 2), \n",
    "                                            round(np.var(ratings), 2)))\n",
    "        print(\"There are {} unique users and {} unique movies\".format(len(unique_users), len(unique_movies)))\n",
    "    return to_data_list\n",
    "\n",
    "\n",
    "def csv_to_augmented_data_dict(filename, delimiter):\n",
    "    \"\"\"\n",
    "    Input: a file that must be readable as csv and separated by delimiter (to make columns)\n",
    "    has format users - movies - ratings - etc\n",
    "    Output:\n",
    "      Users dictionary: keys as user ID's; each key corresponds to a list of movie ratings by that user\n",
    "      Movies dictionary: keys as movie ID's; each key corresponds a list of ratings of that movie by different users\n",
    "    \"\"\"\n",
    "    to_users_dict = dict() \n",
    "    to_movies_dict = dict()\n",
    "    with open(filename, 'r') as csvfile:\n",
    "        reader = csv.reader(csvfile, delimiter=delimiter)\n",
    "        for count, row in enumerate(reader):\n",
    "            #if count!=0:\n",
    "            if row[0] not in to_users_dict:\n",
    "                to_users_dict[row[0]] = [(row[1], row[2])]\n",
    "            else:\n",
    "                to_users_dict[row[0]].append((row[1], row[2]))\n",
    "            if row[1] not in to_movies_dict:\n",
    "                to_movies_dict[row[1]] = list(row[0])\n",
    "            else:\n",
    "                to_movies_dict[row[1]].append(row[0])\n",
    "    return to_users_dict, to_movies_dict\n",
    "\n",
    "\n",
    "def user_dict_to_data_list(user_dict):\n",
    "    # turn user_dict format to data list format (acceptable to the algorithm)\n",
    "    data_list = list()\n",
    "    for user, movie_rating_list in user_dict.items():\n",
    "        for movie, rating in movie_rating_list:\n",
    "            data_list.append({'in0':[int(user)], 'in1':[int(movie)], 'label':float(rating)})\n",
    "    return data_list\n",
    "\n",
    "def divide_user_dicts(user_dict, sp_ratio_dict):\n",
    "    \"\"\"\n",
    "    Input: A user dictionary, a ration dictionary\n",
    "         - format of sp_ratio_dict = {'train':0.8, \"test\":0.2}\n",
    "    Output: \n",
    "        A dictionary of dictionaries, with key corresponding to key provided by sp_ratio_dict\n",
    "        and each key corresponds to a subdivded user dictionary\n",
    "    \"\"\"\n",
    "    ratios = [val for _, val in sp_ratio_dict.items()]\n",
    "    assert np.sum(ratios) == 1, \"the sampling ratios must sum to 1!\"\n",
    "    divided_dict = {}\n",
    "    for user, movie_rating_list in user_dict.items():\n",
    "        sub_movies_ptr = 0\n",
    "        sub_movies_list = []\n",
    "        #movie_list, _ = zip(*movie_rating_list)\n",
    "        #print(movie_list)\n",
    "        for i, ratio in enumerate(ratios):\n",
    "            if i < len(ratios)-1:\n",
    "                sub_movies_ptr_end = sub_movies_ptr + int(len(movie_rating_list)*ratio)\n",
    "                sub_movies_list.append(movie_rating_list[sub_movies_ptr:sub_movies_ptr_end])\n",
    "                sub_movies_ptr = sub_movies_ptr_end\n",
    "            else:\n",
    "                sub_movies_list.append(movie_rating_list[sub_movies_ptr:])\n",
    "        for subset_name in sp_ratio_dict.keys():\n",
    "            if subset_name not in divided_dict:\n",
    "                divided_dict[subset_name] = {user: sub_movies_list.pop(0)}\n",
    "            else:\n",
    "                #access sub-dictionary\n",
    "                divided_dict[subset_name][user] = sub_movies_list.pop(0)\n",
    "    \n",
    "    return divided_dict\n",
    "\n",
    "def write_csv_to_jsonl(jsonl_fname, csv_fname, csv_delimiter):\n",
    "    \"\"\"\n",
    "    Input: a file readable as csv and separated by delimiter (to make columns)\n",
    "        - has format users - movies - ratings - etc\n",
    "    Output: a jsonline file converted from the csv file\n",
    "    \"\"\"\n",
    "    with jsonlines.open(jsonl_fname, mode='w') as writer:\n",
    "        with open(csv_fname, 'r') as csvfile:\n",
    "            reader = csv.reader(csvfile, delimiter=csv_delimiter)\n",
    "            for count, row in enumerate(reader):\n",
    "                #print(row)\n",
    "                #if count!=0:\n",
    "                writer.write({'in0':[int(row[0])], 'in1':[int(row[1])], 'label':float(row[2])})\n",
    "        print('Created {} jsonline file'.format(jsonl_fname))\n",
    "                    \n",
    "    \n",
    "def write_data_list_to_jsonl(data_list, to_fname):\n",
    "    \"\"\"\n",
    "    Input: a data list, where each row of the list is a Python dictionary taking form\n",
    "    {'in0':userID, 'in1':movieID, 'label':rating}\n",
    "    Output: save the list as a jsonline file\n",
    "    \"\"\"\n",
    "    with jsonlines.open(to_fname, mode='w') as writer:\n",
    "        for row in data_list:\n",
    "            #print(row)\n",
    "            writer.write({'in0':row['in0'], 'in1':row['in1'], 'label':row['label']})\n",
    "    print(\"Created {} jsonline file\".format(to_fname))\n",
    "\n",
    "def data_list_to_inference_format(data_list, binarize=True, label_thres=3):\n",
    "    \"\"\"\n",
    "    Input: a data list\n",
    "    Output: test data and label, acceptable by SageMaker for inference\n",
    "    \"\"\"\n",
    "    data_ = [({\"in0\":row['in0'], 'in1':row['in1']}, row['label']) for row in data_list]\n",
    "    data, label = zip(*data_)\n",
    "    infer_data = {\"instances\":data}\n",
    "    if binarize:\n",
    "        label = get_binarized_label(list(label), label_thres)\n",
    "    return infer_data, label\n",
    "\n",
    "\n",
    "def get_binarized_label(data_list, thres):\n",
    "    \"\"\"\n",
    "    Input: data list\n",
    "    Output: a binarized data list for recommendation task\n",
    "    \"\"\"\n",
    "    for i, row in enumerate(data_list):\n",
    "        if type(row) is dict:\n",
    "            #if i < 10:\n",
    "                #print(row['label'])\n",
    "            if row['label'] > thres:\n",
    "                #print(row)\n",
    "                data_list[i]['label'] = 1\n",
    "            else:\n",
    "                data_list[i]['label'] = 0\n",
    "        else:\n",
    "            if row > thres:\n",
    "                data_list[i] = 1\n",
    "            else:\n",
    "                data_list[i] = 0\n",
    "    return data_list\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Load data and shuffle\n",
    "prefix = 'ml-100k'\n",
    "train_path = os.path.join(prefix, 'ua.base')\n",
    "valid_path = os.path.join(prefix, 'ua.test')\n",
    "test_path = os.path.join(prefix, 'ub.test')\n",
    "\n",
    "train_data_list = load_csv_data(train_path, '\\t')\n",
    "random.shuffle(train_data_list)\n",
    "validation_data_list = load_csv_data(valid_path, '\\t')\n",
    "random.shuffle(validation_data_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_users_dict, to_movies_dict = csv_to_augmented_data_dict(train_path, '\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We perform some data exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Calculate min, max, median of number of movies per user\n",
    "movies_per_user = [len(val) for key, val in to_users_dict.items()]\n",
    "\n",
    "print(\"The min, max, and median 'movies per user' is {}, {}, and {}\".format(np.amin(movies_per_user),\n",
    "                                                                         np.amax(movies_per_user),\n",
    "                                                                         np.median(movies_per_user)))\n",
    "users_per_movie = [len(val) for key, val in to_movies_dict.items()]\n",
    "print(\"The min, max, and median 'users per movie' is {}, {}, and {}\".format(np.amin(users_per_movie),\n",
    "                                                                         np.amax(users_per_movie),\n",
    "                                                                          np.median(users_per_movie)))\n",
    "\n",
    "\n",
    "count = 0\n",
    "n_movies_lower_bound = 20\n",
    "for n_movies in movies_per_user:\n",
    "    if n_movies <= n_movies_lower_bound:\n",
    "        count += 1\n",
    "print(\"In the training set\")\n",
    "print('There are {} users with no more than {} movies'.format(count, n_movies_lower_bound))\n",
    "#\n",
    "count = 0\n",
    "n_users_lower_bound = 2\n",
    "for n_users in users_per_movie:\n",
    "    if n_users <= n_users_lower_bound:\n",
    "        count += 1\n",
    "print('There are {} movies with no more than {} user'.format(count, n_users_lower_bound))\n",
    "\n",
    "\n",
    "## figures\n",
    "\n",
    "f = plt.figure(1)\n",
    "plt.hist(movies_per_user)\n",
    "plt.title(\"Movies per user\")\n",
    "##\n",
    "g = plt.figure(2)\n",
    "plt.hist(users_per_movie)\n",
    "plt.title(\"Users per movie\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the number of movies with an extremely small number of users (<3) is negligible compared to the total number of movies, we will not remove movies from the data set (same applies for users) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Save training and validation data locally for rating-prediction (regression) task\n",
    "\n",
    "write_data_list_to_jsonl(copy.deepcopy(train_data_list), 'train_r.jsonl')\n",
    "write_data_list_to_jsonl(copy.deepcopy(validation_data_list), 'validation_r.jsonl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Save training and validation data locally for recommendation (classification) task\n",
    "\n",
    "### binarize the data \n",
    "\n",
    "train_c = get_binarized_label(copy.deepcopy(train_data_list), 3.0)\n",
    "valid_c = get_binarized_label(copy.deepcopy(validation_data_list), 3.0)\n",
    "\n",
    "write_data_list_to_jsonl(train_c, 'train_c.jsonl')\n",
    "write_data_list_to_jsonl(valid_c, 'validation_c.jsonl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We check whether the two classes are balanced after binarization**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_c_label = [row['label'] for row in train_c]\n",
    "valid_c_label = [row['label'] for row in valid_c]\n",
    "\n",
    "print(\"There are {} fraction of positive ratings in train_c.jsonl\".format(\n",
    "                                np.count_nonzero(train_c_label)/len(train_c_label)))\n",
    "print(\"There are {} fraction of positive ratings in validation_c.jsonl\".format(\n",
    "                                np.sum(valid_c_label)/len(valid_c_label)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rating prediction task "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_mse_loss(res, labels):\n",
    "    if type(res) is dict:\n",
    "        res = res['predictions']\n",
    "    assert len(res)==len(labels), 'result and label length mismatch!'\n",
    "    loss = 0\n",
    "    for row, label in zip(res, labels):\n",
    "        if type(row)is dict:\n",
    "            loss += (row['scores'][0]-label)**2\n",
    "        else:\n",
    "            loss += (row-label)**2\n",
    "    return round(loss/float(len(labels)), 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_r_data, valid_r_label = data_list_to_inference_format(copy.deepcopy(validation_data_list), binarize=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We first test the problem on two baseline algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline 1\n",
    "\n",
    "A naive approach to predict movie ratings on unseen data is to use the global average of the user predictions in the training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_r_label = [row['label'] for row in copy.deepcopy(train_data_list)]\n",
    "\n",
    "bs1_prediction = round(np.mean(train_r_label), 2)\n",
    "print('The Baseline 1 (global rating average) prediction is {}'.format(bs1_prediction))\n",
    "print(\"The validation mse loss of the Baseline 1 is {}\".format(\n",
    "                                     get_mse_loss(len(valid_r_label)*[bs1_prediction], valid_r_label)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use a better baseline, which is to perform prediction on unseen data based on the user-averaged ratings of movies on training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bs2_predictor(test_data, user_dict, is_classification=False, thres=3):\n",
    "    test_data = copy.deepcopy(test_data['instances'])\n",
    "    predictions = list()\n",
    "    for row in test_data:\n",
    "        userID = str(row[\"in0\"][0])\n",
    "        # predict movie ID based on local average of user's prediction\n",
    "        local_movies, local_ratings = zip(*user_dict[userID])\n",
    "        local_ratings = [float(score) for score in local_ratings]\n",
    "        predictions.append(np.mean(local_ratings))\n",
    "        if is_classification:\n",
    "            predictions[-1] = int(predictions[-1] > 3)\n",
    "    return predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs2_prediction = bs2_predictor(valid_r_data, to_users_dict, is_classification=False)\n",
    "print(\"The validation loss of the Baseline 2 (user-based rating average) is {}\".format(\n",
    "                                     get_mse_loss(bs2_prediction, valid_r_label)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will use *Object2Vec* to predict the movie ratings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model training and inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define S3 bucket that hosts data and model, and upload data to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3 \n",
    "import os\n",
    " \n",
    "bucket = '<Your bucket name>' # Customize your own bucket name\n",
    "input_prefix = 'object2vec/movielens/input'\n",
    "output_prefix = 'object2vec/movielens/output'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Upload data to S3 and make data paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.session import s3_input\n",
    "\n",
    "s3_client = boto3.client('s3')\n",
    "input_paths = {}\n",
    "output_path = os.path.join('s3://', bucket, output_prefix)\n",
    "\n",
    "for data_name in ['train', 'validation']:\n",
    "    pre_key = os.path.join(input_prefix, 'rating', f'{data_name}')\n",
    "    fname = '{}_r.jsonl'.format(data_name)\n",
    "    data_path = os.path.join('s3://', bucket, pre_key, fname)\n",
    "    s3_client.upload_file(fname, bucket, os.path.join(pre_key, fname))\n",
    "    input_paths[data_name] = s3_input(data_path, distribution='ShardedByS3Key', content_type='application/jsonlines')\n",
    "    print('Uploaded {} data to {} and defined input path'.format(data_name, data_path))\n",
    "\n",
    "print('Trained model will be saved at', output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get ObjectToVec algorithm image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "role = get_execution_role()\n",
    "print(role)\n",
    "\n",
    "## Get docker image of ObjectToVec algorithm\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'object2vec')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We first define training hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameters = {\n",
    "    \"_kvstore\": \"device\",\n",
    "    \"_num_gpus\": \"auto\",\n",
    "    \"_num_kv_servers\": \"auto\",\n",
    "    \"bucket_width\": 0,\n",
    "    \"early_stopping_patience\": 3,\n",
    "    \"early_stopping_tolerance\": 0.01,\n",
    "    \"enc0_cnn_filter_width\": 3,\n",
    "    \"enc0_layers\": \"auto\",\n",
    "    \"enc0_max_seq_len\": 1,\n",
    "    \"enc0_network\": \"pooled_embedding\",\n",
    "    \"enc0_token_embedding_dim\": 300,\n",
    "    \"enc0_vocab_size\": 944,\n",
    "    \"enc1_layers\": \"auto\",\n",
    "    \"enc1_max_seq_len\": 1,\n",
    "    \"enc1_network\": \"pooled_embedding\",\n",
    "    \"enc1_token_embedding_dim\": 300,\n",
    "    \"enc1_vocab_size\": 1684,\n",
    "    \"enc_dim\": 1024,\n",
    "    \"epochs\": 20,\n",
    "    \"learning_rate\": 0.001,\n",
    "    \"mini_batch_size\": 64,\n",
    "    \"mlp_activation\": \"tanh\",\n",
    "    \"mlp_dim\": 256,\n",
    "    \"mlp_layers\": 1,\n",
    "    \"num_classes\": 2,\n",
    "    \"optimizer\": \"adam\",\n",
    "    \"output_layer\": \"mean_squared_error\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## get estimator\n",
    "regressor = sagemaker.estimator.Estimator(container,\n",
    "                                          role, \n",
    "                                          train_instance_count=1, \n",
    "                                          train_instance_type='ml.p2.xlarge',\n",
    "                                          output_path=output_path,\n",
    "                                          sagemaker_session=sess)\n",
    "\n",
    "## set hyperparameters\n",
    "regressor.set_hyperparameters(**hyperparameters)\n",
    "\n",
    "## train the model\n",
    "regressor.fit(input_paths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have seen that we can upload train (validation) data through the input data channel, and the algorithm will print out train (validation) evaluation metric during training. In addition, the algorithm uses the validation metric to perform early stopping. \n",
    "\n",
    "What if we want to send additional unlabeled data to the algorithm and get predictions from the trained model?\n",
    "This step is called *inference* in the Sagemaker framework. Next, we demonstrate how to use a trained model to perform inference on unseen data points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference using trained model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create and deploy the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import numpy as np\n",
    "from sagemaker.predictor import json_serializer, json_deserializer\n",
    "\n",
    "# create a model using the trained algorithm\n",
    "regression_model = regressor.create_model(\n",
    "                        serializer=json_serializer,\n",
    "                        deserializer=json_deserializer,\n",
    "                        content_type='application/json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# deploy the model\n",
    "predictor = regression_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we send validation data (without labels) to the deployed endpoint for inference. We will see that the resulting prediction error we get from post-training inference matches the best validation error from the training log in the console above (up to floating point error). If you follow the training instruction and parameter setup, you should get mean squared error on the validation set approximately 0.91."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Send data to the endpoint to get predictions\n",
    "prediction = predictor.predict(valid_r_data)\n",
    "\n",
    "print(\"The mean squared error on validation set is %.3f\" %get_mse_loss(prediction, valid_r_label))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparison against popular libraries\n",
    "\n",
    "Below we provide a chart that compares the performance of *Object2Vec* against several algorithms implemented by popular recommendation system libraries (LibRec https://www.librec.net/ and scikit-surprise http://surpriselib.com/). The error metric we use in the chart is **root mean squared** (RMSE) instead of MSE, so that our result can be compared against the reported results in the aforementioned libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"ml-experiment-plot.png\" width=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recommendation task \n",
    "\n",
    "In this section, we showcase how to use *Object2Vec* to recommend movies, using the binarized rating labels. Here, if a movie rating label for a given user is binarized to `1`, then it means that the movie should be recommended to the user; otherwise, the label is binarized to `0`. The binarized data set is already obtained in the preprocessing section, so we will proceed to apply the algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We upload the binarized datasets for classification task to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for data_name in ['train', 'validation']:\n",
    "    fname = '{}_c.jsonl'.format(data_name)\n",
    "    pre_key = os.path.join(input_prefix, 'recommendation', f\"{data_name}\")\n",
    "    data_path = os.path.join('s3://', bucket, pre_key, fname)\n",
    "    s3_client.upload_file(fname, bucket, os.path.join(pre_key, fname))\n",
    "    input_paths[data_name] = s3_input(data_path, distribution='ShardedByS3Key', content_type='application/jsonlines')\n",
    "    print('Uploaded data to {}'.format(data_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we already get the algorithm image from the regression task, we can directly start training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.session import s3_input\n",
    "\n",
    "hyperparameters_c = {\n",
    "    \"_kvstore\": \"device\",\n",
    "    \"_num_gpus\": \"auto\",\n",
    "    \"_num_kv_servers\": \"auto\",\n",
    "    \"bucket_width\": 0,\n",
    "    \"early_stopping_patience\": 3, \n",
    "    \"early_stopping_tolerance\": 0.01,\n",
    "    \"enc0_cnn_filter_width\": 3,\n",
    "    \"enc0_layers\": \"auto\",\n",
    "    \"enc0_max_seq_len\": 1,\n",
    "    \"enc0_network\": \"pooled_embedding\",\n",
    "    \"enc0_token_embedding_dim\": 300,\n",
    "    \"enc0_vocab_size\": 944,\n",
    "    \"enc1_cnn_filter_width\": 3,\n",
    "    \"enc1_layers\": \"auto\",\n",
    "    \"enc1_max_seq_len\": 1,\n",
    "    \"enc1_network\": \"pooled_embedding\",\n",
    "    \"enc1_token_embedding_dim\": 300,\n",
    "    \"enc1_vocab_size\": 1684,\n",
    "    \"enc_dim\": 2048,\n",
    "    \"epochs\": 20,\n",
    "    \"learning_rate\": 0.001,\n",
    "    \"mini_batch_size\": 2048,\n",
    "    \"mlp_activation\": \"relu\",\n",
    "    \"mlp_dim\": 1024,\n",
    "    \"mlp_layers\": 1,\n",
    "    \"num_classes\": 2,\n",
    "    \"optimizer\": \"adam\",\n",
    "    \"output_layer\": \"softmax\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## get estimator\n",
    "classifier = sagemaker.estimator.Estimator(container,\n",
    "                                    role, \n",
    "                                    train_instance_count=1, \n",
    "                                    train_instance_type='ml.p2.xlarge',\n",
    "                                    output_path=output_path,\n",
    "                                    sagemaker_session=sess)\n",
    "\n",
    "## set hyperparameters\n",
    "classifier.set_hyperparameters(**hyperparameters_c)\n",
    "\n",
    "## train, tune, and test the model\n",
    "classifier.fit(input_paths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we can create, deploy, and validate the model after training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classification_model = classifier.create_model(\n",
    "                        serializer=json_serializer,\n",
    "                        deserializer=json_deserializer,\n",
    "                        content_type='application/json')\n",
    "\n",
    "predictor_2 = classification_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_c_data, valid_c_label = data_list_to_inference_format(copy.deepcopy(validation_data_list), \n",
    "                                                            label_thres=3, binarize=True)\n",
    "predictions = predictor_2.predict(valid_c_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_class_accuracy(res, labels, thres):\n",
    "    if type(res) is dict:\n",
    "        res = res['predictions']\n",
    "    assert len(res)==len(labels), 'result and label length mismatch!'\n",
    "    accuracy = 0\n",
    "    for row, label in zip(res, labels):\n",
    "        if type(row) is dict:\n",
    "            if row['scores'][1] > thres:\n",
    "                prediction = 1\n",
    "            else: \n",
    "                prediction = 0\n",
    "            if label > thres:\n",
    "                label = 1\n",
    "            else:\n",
    "                label = 0\n",
    "            accuracy += 1 - (prediction - label)**2\n",
    "    return accuracy / float(len(res))\n",
    "\n",
    "print(\"The accuracy on the binarized validation set is %.3f\" %get_class_accuracy(predictions, valid_c_label, 0.5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The accuracy on validation set you would get should be approximately 0.704."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Movie retrieval in the embedding space\n",
    "\n",
    "Since *Object2Vec* transforms user and movie ID's into embeddings as part of the training process. After training, it obtains user and movie embeddings in the left and right encoders, respectively. Intuitively, the embeddings should be tuned by the algorithm in a way that facilitates the supervised learning task: since for a specific user, similar movies should have similar ratings, we expect that similar movies should be **close-by** in the embedding space.\n",
    "\n",
    "In this section, we demonstrate how to find the nearest-neighbor (in Euclidean distance) of a given movie ID, among all movie ID's."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_movie_embedding_dict(movie_ids, trained_model):\n",
    "    input_instances = list()\n",
    "    for s_id in movie_ids:\n",
    "        input_instances.append({'in1': [s_id]})\n",
    "    data = {'instances': input_instances}\n",
    "    movie_embeddings = trained_model.predict(data)\n",
    "    embedding_dict = {}\n",
    "    for s_id, row in zip(movie_ids, movie_embeddings['predictions']):\n",
    "        embedding_dict[s_id] = np.array(row['embeddings'])\n",
    "    return embedding_dict\n",
    "\n",
    "\n",
    "def load_movie_id_name_map(item_file):\n",
    "    movieID_name_map = {}\n",
    "    with open(item_file, 'r', encoding=\"ISO-8859-1\") as f:\n",
    "        for row in f.readlines():\n",
    "            row = row.strip()\n",
    "            split = row.split('|')\n",
    "            movie_id = split[0]\n",
    "            movie_name = split[1]\n",
    "            sparse_tags = split[-19:]\n",
    "            movieID_name_map[int(movie_id)] = movie_name \n",
    "    return movieID_name_map\n",
    "\n",
    "            \n",
    "def get_nn_of_movie(movie_id, candidate_movie_ids, embedding_dict):\n",
    "    movie_emb = embedding_dict[movie_id]\n",
    "    min_dist = float('Inf')\n",
    "    best_id = candidate_movie_ids[0]\n",
    "    for idx, m_id in enumerate(candidate_movie_ids):\n",
    "        candidate_emb = embedding_dict[m_id]\n",
    "        curr_dist = np.linalg.norm(candidate_emb - movie_emb)\n",
    "        if curr_dist < min_dist:\n",
    "            best_id = m_id\n",
    "            min_dist = curr_dist\n",
    "    return best_id, min_dist\n",
    "\n",
    "\n",
    "def get_unique_movie_ids(data_list):\n",
    "    unique_movie_ids = set()\n",
    "    for row in data_list:\n",
    "        unique_movie_ids.add(row['in1'][0])\n",
    "    return list(unique_movie_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data_list = load_csv_data(train_path, '\\t', verbose=False)\n",
    "unique_movie_ids = get_unique_movie_ids(train_data_list)\n",
    "embedding_dict = get_movie_embedding_dict(unique_movie_ids, predictor_2)\n",
    "candidate_movie_ids = unique_movie_ids.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the script below, you can check out what is the closest movie to any movie in the data set. Last time we ran it, the closest movie to `Terminator, The (1984)` in the embedding space was `Die Hard (1988)`. Note that, the result will likely differ slightly across different runs of the algorithm, due to randomness in initialization of model parameters.\n",
    "\n",
    "- Just plug in the movie id you want to examine \n",
    "   - For example, the movie ID for Terminator is 195; you can find the movie name and ID pair in the `u.item` file\n",
    "- Note that, the result will likely differ across different runs of the algorithm, due to inherent randomness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movie_id_to_examine = '<movie id>' # Customize the movie ID you want to examine\n",
    "candidate_movie_ids.remove(movie_id_to_examine)\n",
    "best_id, min_dist = get_nn_of_movie(movie_id_to_examine, candidate_movie_ids, embedding_dict)\n",
    "movieID_name_map = load_movie_id_name_map('ml-100k/u.item')\n",
    "print('The closest movie to {} in the embedding space is {}'.format(movieID_name_map[movie_id_to_examine],\n",
    "                                                                  movieID_name_map[best_id]))\n",
    "candidate_movie_ids.append(movie_id_to_examine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is recommended to always delete the endpoints used for hosting the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean up\n",
    "sess.delete_endpoint(predictor.endpoint)\n",
    "sess.delete_endpoint(predictor_2.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
